suppressPackageStartupMessages({
library(piecewiseSEM)
library(magrittr)
library(tidyverse)
library(stargazer)
library(broom)
})
storks <- read.csv("Data/storks.csv")
This course was part of the pre-conference workshops at the International Congress for Conservation Biology, in Cartagen Colombia. The course was given by Dr. Achaz von Hardenberg @achazhardenberg.
Subjects on causal inference, path analalysis and bayesian inference were covered.
The scientist’s mantra: “Correlation does not imply causation”
ggplot(data = storks, mapping=aes(x = Storks, y = Birth)) +
geom_point() +
geom_smooth(method = "lm", color = "black") +
theme_bw() +
labs(x = "Number of breeding stork pairs", y = "Human birth rate (thousands/year)")
lm(Birth ~ Storks, data = storks) %>%
stargazer(single.row = T, type = "html", dep.var.labels = "Human birth rate *thousands / year)")
| Dependent variable: | |
| Human birth rate *thousands / year) | |
| Storks | 0.029*** (0.009) |
| Constant | 225.029** (93.561) |
| Observations | 17 |
| R2 | 0.385 |
| Adjusted R2 | 0.344 |
| Residual Std. Error | 332.185 (df = 15) |
| F Statistic | 9.380*** (df = 1; 15) |
| Note: | p<0.1; p<0.05; p<0.01 |
Other variables appear to be correlated, so we can inspect their correlation and include relevant ones in the model.
library(corrplot)
corrplot(cor(storks[,-1]), method="ellipse", type="lower")
lm1 <- lm(Birth ~ Storks, storks)
lm2 <- lm(Area ~ Storks, storks)
lm3 <- lm(Birth ~ Area, storks)
lm4 <- lm(Birth ~ Area + Storks, storks)
stargazer(lm1, lm2, lm3, lm4, type ="html", single.row = T, add.lines = list(AIC = c("AIC", formatC(AIC(lm1, lm2, lm3, lm4)[,2], digits = 2, format = "f"))))
| Dependent variable: | ||||
| Birth | Area | Birth | ||
| (1) | (2) | (3) | (4) | |
| Storks | 0.029*** (0.009) | 14.400** (5.231) | 0.006 (0.006) | |
| Area | 0.002*** (0.0002) | 0.002*** (0.0002) | ||
| Constant | 225.029** (93.561) | 146,817.600** (52,057.720) | -7.775 (56.938) | -7.412 (56.702) |
| AIC | 249.51 | 464.44 | 225.39 | 226.08 |
| Observations | 17 | 17 | 17 | 17 |
| R2 | 0.385 | 0.336 | 0.851 | 0.862 |
| Adjusted R2 | 0.344 | 0.291 | 0.841 | 0.842 |
| Residual Std. Error | 332.185 (df = 15) | 184,830.100 (df = 15) | 163.422 (df = 15) | 162.744 (df = 14) |
| F Statistic | 9.380*** (df = 1; 15) | 7.578** (df = 1; 15) | 85.731*** (df = 1; 15) | 43.787*** (df = 2; 14) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
After including Area, for example, we observe that the effect of Storks significantly decreases.
Correlation imples and unresolved causal structure
Once we have identified processes, we can start to imply causation:
Causality always imples a completely resolved correlation structure
Limitations of commonly employed statistical methods (i.e. multiple linear regression):
Can only analyze one dependent variable at a time
A particular variable can either be a predictor or a response
“Path analysis is an extension of multiple regression, and was developed to overcome these limitations”
In path analysis, we call each variable a vertex (or vertices, in plural) and their connections are called edges. Directed graphs include directions (i.e. the edges are arrow that define direction). In Indirect graphs, edges only connect vertices, but do not specify the direction of influence.
A full explanation on the grammar of path analysis is found in Achaz’s materials
From his model:
Causal path of 6 variables
A is:
direct cause of B
indirect cause of C and F
B is:
C and D are:
indirectly independent on A and F
directly dependent on B and E
E is:
F is:
direct cause of E
Indirect cause of C and F
In the case above, B is an active vertex, becauser it is bot an effect (of A) and a cause (of C or D). Similarly, D is an inactive vertez, which is only an effect of 1 or more other vertices, but not a cause of anything. Often, we try to understand relationships between D and A, leaving out B.
Drawa causal graph with 6 variables:
(fromAtoF)where:
A is direct cause of B
C is a causal child of B
D is directly dependent on B and C
F is directly dependent on D and E
DiagrammeR::grViz("
digraph boxes_and_circles{
# Define nodes
node [shape = square
penwidth = 2,
fontsize = 24]
A; B; C; D; E; F
# Add edge statements
#Advaance classes
A -> B
B -> C
C -> D
B -> D
D -> F
E -> F
}
", height = 500)
To start transforming this into statistical models, we perform d-separation:
d_separ <- data.frame(Pairs = c("C,D","A,C", "A,D", "A,E", "A,F", "B,E","B,F", "C,F","D,F"))
d_separ
## Pairs
## 1 C,D
## 2 A,C
## 3 A,D
## 4 A,E
## 5 A,F
## 6 B,E
## 7 B,F
## 8 C,F
## 9 D,F
d_separ <- data.frame(Pairs = c("C,D","A,C", "A,D", "A,E", "A,F", "B,E","B,F", "C,F","D,F"),
Parents = c("B,E", "B", "B,E", "F", "None", "A, F", "A", "B,E", "B,E"))
d_separ
## Pairs Parents
## 1 C,D B,E
## 2 A,C B
## 3 A,D B,E
## 4 A,E F
## 5 A,F None
## 6 B,E A, F
## 7 B,F A
## 8 C,F B,E
## 9 D,F B,E
d_separ %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"))
## Pairs Parents D_separation
## 1 C,D B,E (C,D), {B,E}
## 2 A,C B (A,C), {B}
## 3 A,D B,E (A,D), {B,E}
## 4 A,E F (A,E), {F}
## 5 A,F None (A,F), {None}
## 6 B,E A, F (B,E), {A, F}
## 7 B,F A (B,F), {A}
## 8 C,F B,E (C,F), {B,E}
## 9 D,F B,E (D,F), {B,E}
The number of elements in the basis test should be given by
\[\frac{V!}{2(V-2)!}-A\]
In R code, we can define a function for this as:
numbers <- function(V, A){
n <- (factorial(V)/(2*factorial((V-2))))-A
return(n)
}
Where:
-A is the number of arrows
d_separ %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"),
Claims = c("D ~ B + E + C", "C ~ B + A", "D ~ B + E + A", "E ~ F + A", "F ~ A", "E ~ A + F + B", "F ~ B + A", "F ~ B + E + C", "F ~ B + E + D"))
## Pairs Parents D_separation Claims
## 1 C,D B,E (C,D), {B,E} D ~ B + E + C
## 2 A,C B (A,C), {B} C ~ B + A
## 3 A,D B,E (A,D), {B,E} D ~ B + E + A
## 4 A,E F (A,E), {F} E ~ F + A
## 5 A,F None (A,F), {None} F ~ A
## 6 B,E A, F (B,E), {A, F} E ~ A + F + B
## 7 B,F A (B,F), {A} F ~ B + A
## 8 C,F B,E (C,F), {B,E} F ~ B + E + C
## 9 D,F B,E (D,F), {B,E} F ~ B + E + D
We then use Fisher’s C test to test the composite probability of the whole set.
With a model of the sape:
DiagrammeR::grViz("
digraph boxes_and_circles{
# Define nodes
node [shape = square
penwidth = 2,
fontsize = 24]
A; S; B; I
# Add edge statements
#Advaance classes
A -> S
A -> B
B -> I
}
", height = 500)
Where:
A = Area
S = Storks
B = Birth rate
I = Inhabitants
Using the formula above, we know that we need to identify 3 pairs.
Using the steps above, we must identify all non-adjacent pairs
storke_pairs <- data.frame(Pairs = c("A, I", "S, B", "S,I"))
storke_pairs
## Pairs
## 1 A, I
## 2 S, B
## 3 S,I
Identify all parents of each pair
storke_pairs <- data.frame(Pairs = c("A, I", "S, B", "S,I"),
Parents = c("B", "A", "A, B"))
storke_pairs
## Pairs Parents
## 1 A, I B
## 2 S, B A
## 3 S,I A, B
The d-separation statements are then:
storke_pairs %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"))
## Pairs Parents D_separation
## 1 A, I B (A, I), {B}
## 2 S, B A (S, B), {A}
## 3 S,I A, B (S,I), {A, B}
Test the probabilistic conditional dependence of every d-statement with the following claims:
storke_pairs %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"),
Claims = c("I ~ B + A", "B ~ A + S", "I ~ A + B + I"))
## Pairs Parents D_separation Claims
## 1 A, I B (A, I), {B} I ~ B + A
## 2 S, B A (S, B), {A} B ~ A + S
## 3 S,I A, B (S,I), {A, B} I ~ A + B + I
Evaluate the models:
colnames(storks) <- c("Country", "A", "S", "I", "B")
path1 <- lm(I ~ B + A, storks)
path2 <- lm(B ~ A + S, storks)
path3 <- lm(I ~ A + B + S, storks)
stargazer(path1, path2, path3, type = "html", report = c("vcp"), single.row = T)
| Dependent variable: | |||
| I | B | I | |
| (1) | (2) | (3) | |
| B | 0.039 | 0.049 | |
| p = 0.079 | p = 0.032 | ||
| A | 0.00002 | 0.002 | 0.00002 |
| p = 0.624 | p = 0.00001 | p = 0.575 | |
| S | 0.006 | -0.001 | |
| p = 0.307 | p = 0.111 | ||
| Constant | 6.744 | -7.412 | 6.771 |
| p = 0.162 | p = 0.898 | p = 0.138 | |
| Observations | 17 | 17 | 17 |
| R2 | 0.729 | 0.862 | 0.779 |
| Adjusted R2 | 0.691 | 0.842 | 0.728 |
| Residual Std. Error | 13.084 (df = 14) | 162.744 (df = 14) | 12.264 (df = 13) |
| F Statistic | 18.872*** (df = 2; 14) | 43.787*** (df = 2; 14) | 15.297*** (df = 3; 13) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Using Fisher’s C:
\[C = -2\sum_{i = 1}^{k{}} log(p_i)\] Where:
C follows a \(\chi^2\) distribution, with 2k degrees of freedom: 1-pchisq(C,2*k)
extract_p <- function(model, coeff){
tidy(model) %>%
filter(term == coeff) %>%
select(p.value)
}
ppath1 <- extract_p(path1, "A")
ppath2 <- extract_p(path2, "S")
ppath3 <- extract_p(path3, "S")
C <- rbind(ppath1, ppath2, ppath3) %>%
mutate(logs = log(p.value)) %$%
sum(logs)
C <- -2*C
test <- 1-pchisq(C,2*3)
Thus, the probability is p = 0.2597204.
piecewiseSEMHere, we only specify the dependent paths as within an lm() call. For example, in this model, I is caused by B, so we call lm(B~I). We include each lm call inside a list, and pass that to sem.fit, which will identify the missing paths, and do all the calculations.
storks_model <- list(
lm(S ~ A, storks),
lm(B ~ A, storks),
lm(I ~ B, storks)) %>%
sem.fit(data = storks, .progressBar = F, conditional = T) %>%
knitr::kable()